Skip to main content

Working With Data Loaders in ApertureDB

DataLoaders are a tool for ingesting large amounts of data in ApertureDB

# first open the csv files to see what data is there
with open("input/people.adb.csv") as f:
print(f.read())

with open("input/images.adb.csv") as f:
print(f.read())

with open("input/connections-people-images.adb.csv") as f:
print(f.read())
from aperturedb import Connector, Utils
from aperturedb import EntityDataCSV, ConnectionDataCSV
from aperturedb import ImageDataCSV, ParallelLoader

# ApertureDB Server Info for establishing connection
db_host = "aperturedb.local" # assuming local installation as provided
user = "admin" # requires authentication
password = "admin" # use the password provided for the instance

# These variables are for controlling how data is loaded into aperture
batchsize = 100
numthreads = 32
stats = True

DataLoaders are split into 2 pieces

CSV Files are loaded by CSV Parsers which then feed commands to a Loader.

These two abstractions work together to provide a powerful and modular system for loading data into ApertureDB

Loading Entities

Many datasets have information which is not directly tied to binary data. These are representable by Entities.

Here we show a simple example of how to load a csv of entities.

# First connect to the DB and get an instance
from aperturedb import Connector

db = Connector.Connector(db_host, user=user, password=password)


# Next we ensure the database is empty.
utils = Utils.Utils(db)
utils.remove_all_objects()

in_csv_file = "./input/people.adb.csv"

# First we choose a csv loader which corresponds to the data in the file.
# EntityDataCSV is the loader to use when loading an Entity,
# which has properties, but no associated binary data.
generator = EntityDataCSV.EntityDataCSV(in_csv_file)

# Then we pass the csv loader to the aperture loader
# which connects to the database and ensures the data is loaded

# We are using the variable from above here for controlling now the
# parallelization occurs; numthreads for how parallel, and batch size for
# how many items a thread loads at a time
loader = ParallelLoader.ParallelLoader(db)
loader.ingest(generator, batchsize=batchsize,
numthreads=numthreads,
stats=stats)

utils.summary()

Loading Images

Images are an extremely common item in datasets. Loading images is very similar to loading entities.

# Images.
in_csv_file = "./input/images.adb.csv"

# For an Image that has associated properties, ImageDataCSV is used.
generator = ImageDataCSV.ImageDataCSV(in_csv_file)

# It is loaded in the same way as entity as the csv loader has converted the csv
# data into a format that aperture can understand.
loader = ParallelLoader.ParallelLoader(db)
loader.ingest(generator, batchsize=batchsize,
numthreads=numthreads,
stats=stats)

utils.summary()

Loading Connections

Connections are relationships between entities, images, videos and all other blob data. Loading connections looks very similar.

in_csv_file = "./input/connections-people-images.adb.csv"

# A connection is slightly different in that it links two items that
# already exist, so it must identify the items to be connected.
generator = ConnectionDataCSV.ConnectionDataCSV(in_csv_file)

loader = ParallelLoader.ParallelLoader(db)

loader.ingest(generator, batchsize=batchsize,
numthreads=numthreads,
stats=stats)

utils.summary()

Summary

After the DataLoaders run, we can verify that properties have been added.

from aperturedb import Connector

db = Connector.Connector(db_host, user=user, password=password)

# Now we will retrieve the schema to show how the database has stored the
# CSV input
query = [{
"GetSchema": {}
}]

# Schema API explained here https://docs.aperturedata.io/query_language/Reference/db_commands/GetSchema

response, arr = db.query(query)
db.print_last_response()